Hello Everyone!
Welcome to one of my first Breast Cancer Predictor Project using Logistic Regression. My name is Ignacio Antequera and in this notebook, I explore the Breast Cancer dataset and develop a Logistic Regression model to try classifying suspected cells to Benign or Malignant.
The contents of this notebook will follow the outline below:
In this notebook, I will enhance your comprehension by incorporating visualizations when needed. I trust you will find this notebook engaging, and I encourage you to share your questions or feedback in the comments section below. As a perosn always willing to continue learning, I appreciate any feedback, as it enables me to identify errors and gain fresh insights.
Without further ado, let's delve into the data!"
Extracted from the UCI ML repository
Columns 3 to 32
Ten real-valued features are computed for each cell nucleus:
The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
# Data cleaning and manipulation
import pandas as pd # Pandas for data handling
import numpy as np # NumPy for numerical operations
# Data visualization
import matplotlib.pyplot as plt # Matplotlib for basic plotting
import seaborn as sns # Seaborn for enhanced data visualization
# Machine learning
from sklearn.preprocessing import StandardScaler # For feature scaling
import sklearn.linear_model as skl_lm # Scikit-learn's linear models
from sklearn import preprocessing # Preprocessing utilities
from sklearn import neighbors # K-nearest neighbors
from sklearn.metrics import confusion_matrix, classification_report, precision_score # Model evaluation metrics
from sklearn.model_selection import train_test_split # Splitting data for training and testing
import statsmodels.api as sm # Statsmodels for statistical analysis
import statsmodels.formula.api as smf # Statsmodels formula API for modeling
# Initialize some package settings for data visualization
sns.set(style="whitegrid", color_codes=True, font_scale=1.3)
# Magic command to display Matplotlib plots inline in Jupyter Notebook
%matplotlib inline
# Read in the data from 'project_data.csv' and set the first column as the index
df = pd.read_csv('project_data.csv', index_col=0)
# Display the first 5 rows of the DataFrame to get an initial look at the data
df.head()
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 32 columns
The last column, Unnamed:32, seems like it has a whole bunch of missing values. Let's quickly check for any missing values for other columns as well.
# General summary of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 569 entries, 842302 to 92751 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 diagnosis 569 non-null object 1 radius_mean 569 non-null float64 2 texture_mean 569 non-null float64 3 perimeter_mean 569 non-null float64 4 area_mean 569 non-null float64 5 smoothness_mean 569 non-null float64 6 compactness_mean 569 non-null float64 7 concavity_mean 569 non-null float64 8 concave points_mean 569 non-null float64 9 symmetry_mean 569 non-null float64 10 fractal_dimension_mean 569 non-null float64 11 radius_se 569 non-null float64 12 texture_se 569 non-null float64 13 perimeter_se 569 non-null float64 14 area_se 569 non-null float64 15 smoothness_se 569 non-null float64 16 compactness_se 569 non-null float64 17 concavity_se 569 non-null float64 18 concave points_se 569 non-null float64 19 symmetry_se 569 non-null float64 20 fractal_dimension_se 569 non-null float64 21 radius_worst 569 non-null float64 22 texture_worst 569 non-null float64 23 perimeter_worst 569 non-null float64 24 area_worst 569 non-null float64 25 smoothness_worst 569 non-null float64 26 compactness_worst 569 non-null float64 27 concavity_worst 569 non-null float64 28 concave points_worst 569 non-null float64 29 symmetry_worst 569 non-null float64 30 fractal_dimension_worst 569 non-null float64 31 Unnamed: 32 0 non-null float64 dtypes: float64(31), object(1) memory usage: 146.7+ KB
It looks like our data does not contain any missing values, except for our suspect column Unnamed: 32, which is full of missing values. Let's go ahead and remove this column entirely. After that, let's check for the data type of each column.
# remove the 'Unnamed: 32' column
df = df.drop('Unnamed: 32', axis=1)
# check the data type of each column
df.dtypes
diagnosis object radius_mean float64 texture_mean float64 perimeter_mean float64 area_mean float64 smoothness_mean float64 compactness_mean float64 concavity_mean float64 concave points_mean float64 symmetry_mean float64 fractal_dimension_mean float64 radius_se float64 texture_se float64 perimeter_se float64 area_se float64 smoothness_se float64 compactness_se float64 concavity_se float64 concave points_se float64 symmetry_se float64 fractal_dimension_se float64 radius_worst float64 texture_worst float64 perimeter_worst float64 area_worst float64 smoothness_worst float64 compactness_worst float64 concavity_worst float64 concave points_worst float64 symmetry_worst float64 fractal_dimension_worst float64 dtype: object
Our response variable, diagnosis, is categorical and has two classes, 'B' (Benign) and 'M' (Malignant). All explanatory variables are numerical, so we can skip data type conversion.
Let's now take a closer look at our response variable, since it is the main focus of our analysis. We begin by checking out the distribution of its classes.
# Visualize the distribution of classes
plt.figure(figsize=(8, 4))
# Use the 'x' parameter explicitly for specifying the data
sns.countplot(x='diagnosis', data=df, palette='RdBu')
# Count the number of observations in each class (Benign and Malignant)
benign, malignant = df['diagnosis'].value_counts()
# Display the counts and percentages
print('Number of cells labeled Benign: ', benign)
print('Number of cells labeled Malignant: ', malignant)
print('')
print('% of cells labeled Benign:', round(benign / len(df) * 100, 2), '%')
print('% of cells labeled Malignant:', round(malignant / len(df) * 100, 2), '%')
Number of cells labeled Benign: 357 Number of cells labeled Malignant: 212 % of cells labeled Benign: 62.74 % % of cells labeled Malignant: 37.26 %
Among the 569 observations in our dataset, 357 (approximately 62.7%) are categorized as malignant, while the remaining 212 (about 37.3%) are categorized as benign. When we later develop a predictive model and evaluate it on unseen data, we should anticipate encountering a similar distribution of labels.
Although our dataset comprises 30 columns, excluding the 'id' and 'diagnosis' columns, these columns are closely related. They all provide information about the same 10 essential attributes but differ in perspective—featuring measures such as mean, standard errors, and the mean of the three largest values (referred to as 'worst').
Consequently, we can gain preliminary insights by analyzing the data from just one of these perspectives. For example, we can explore the relationship between the 10 key attributes and the 'diagnosis' variable by focusing exclusively on the 'mean' columns.
To begin, let's investigate any noteworthy patterns between these 10 'mean' columns and the response variable. We'll do this by generating a scatter plot matrix, as depicted below:
# Define a list of columns to be included in the scatter plot matrix
cols = ['diagnosis',
'radius_mean',
'texture_mean',
'perimeter_mean',
'area_mean',
'smoothness_mean',
'compactness_mean',
'concavity_mean',
'concave points_mean',
'symmetry_mean',
'fractal_dimension_mean']
# Generate a pair plot (scatter plot matrix) for the selected columns
sns.pairplot(data=df[cols], hue='diagnosis', palette='RdBu')
<seaborn.axisgrid.PairGrid at 0x22879dea910>
We can observe intriguing patterns in the data. Notably, there are nearly perfect linear relationships among the radius, perimeter, and area attributes, suggesting the potential presence of multicollinearity among these variables. Similarly, another group of variables, including concavity, concave points, and compactness, also appears to hint at multicollinearity.
In the upcoming section, we will create a correlation matrix, similar to the previous visualization, but this time it will display the correlations between the variables rather than presenting them in a scatter plot. This analysis aims to statistically assess whether our hypothesis regarding multicollinearity is supported by the data.
As said earlier, let's take a look at the correlations between our variables. This time however, we will create a correlation matrix with all variables (i.e., the "mean" columns, the "standard errors" columns, as well as the "worst" columns).
# Generate and visualize the correlation matrix
corr = df.corr().round(2)
# Mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool) # Replace np.bool with bool
mask[np.triu_indices_from(mask)] = True
# Set figure size
f, ax = plt.subplots(figsize=(20, 20))
# Define custom colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.tight_layout()
C:\Users\nacho\AppData\Local\Temp\ipykernel_18084\311186659.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = df.corr().round(2)
Looking at the matrix, we can immediately verify the presence of multicollinearity between some of our variables. For instance, the radius_mean column has a correlation of 1 and 0.99 with perimeter_mean and area_mean columns, respectively. This is probably because the three columns essentially contain the same information, which is the physical size of the observation (the cell). Therefore we should only pick one of the three columns when we go into further analysis.
Another place where multicollienartiy is apparent is between the "mean" columns and the "worst" column. For instance, the radius_mean column has a correlation of 0.97 with the radius_worst column. In fact, each of the 10 key attributes display very high (from 0.7 up to 0.97) correlations between its "mean" and "worst" columns. This is somewhat inevitable, because the "worst" columns are essentially just a subset of the "mean" columns; the "worst" columns are also the "mean" of some values (the three largest values among all observations). Therefore, I think we should discard the "worst" columns from our analysis and only focus on the "mean" columns.
In short, we will drop all "worst" columns from our dataset, then pick only one of the three attributes that describe the size of cells. But which one should be pick?
Let's quickly go back to 6th grade and review some geometry. If we think of a cell as roughly taking a form of a circle, then the formula for its radius is, well, its radius, r. The formulae for its perimeter and area are then $2\pi r$ and $\pi r^2$ , respectively. As we can see, a cell's radius is the basic building block of its size. Therefore, I think it is reasonable to choose radius as our attribute to represent the size of a cell.
Similarly, it seems like there is multicollinearity between the attributes compactness, concavity, and concave points. Just like what we did with the size attributes, we should pick only one of these three attributes that contain information on the shape of the cell. I think compactness is an attribute name that is straightforward, so I will remove the other two attributes.
We will now go head and drop all unnecessary columns.
# first, drop all "worst" columns
cols = ['radius_worst',
'texture_worst',
'perimeter_worst',
'area_worst',
'smoothness_worst',
'compactness_worst',
'concavity_worst',
'concave points_worst',
'symmetry_worst',
'fractal_dimension_worst']
df = df.drop(cols, axis=1)
# then, drop all columns related to the "perimeter" and "area" attributes
cols = ['perimeter_mean',
'perimeter_se',
'area_mean',
'area_se']
df = df.drop(cols, axis=1)
# lastly, drop all columns related to the "concavity" and "concave points" attributes
cols = ['concavity_mean',
'concavity_se',
'concave points_mean',
'concave points_se']
df = df.drop(cols, axis=1)
# verify remaining columns
df.columns
Index(['diagnosis', 'radius_mean', 'texture_mean', 'smoothness_mean',
'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'smoothness_se', 'compactness_se',
'symmetry_se', 'fractal_dimension_se'],
dtype='object')
Are we all set now?
Let's take a look at the correlation matrix once again, this time created with our trimmed-down set of variables.
# Draw the heatmap again, with the new correlation matrix
corr = df.corr().round(2)
mask = np.zeros_like(corr, dtype=bool) # Replace np.bool with bool
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.tight_layout()
C:\Users\nacho\AppData\Local\Temp\ipykernel_18084\3836198127.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = df.corr().round(2)
Looks great! Now let's move on to our model.
It's finally time to develop our model! We will start by first splitting our dataset into two parts; one as a training set for the model, and the other as a test set to validate the predictions that the model will make. If we omit this step, the model will be trained and tested on the same dataset, and it will underestimate the true error rate, a phenomenon known as overfitting. It is like writing an exam after taking a look at the questions and answers beforehand. We want to make sure that our model truly has predictive power and is able to accurately label unseen data. We will set the test size to 0.3; i.e., 70% of the data will be assigned to the training set, and the remaining 30% will be used as a test set. In order to obtain consistent results, we will set the random state parameter to a value of 40.
# Split the data into training and testing sets
X = df
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
Now that we have split our data into appropriate sets, let's write down the formula to be used for the logistic regression.
# Create a string for the formula
cols = df.columns.drop('diagnosis')
formula = 'diagnosis ~ ' + ' + '.join(cols)
print(formula, '\n')
diagnosis ~ radius_mean + texture_mean + smoothness_mean + compactness_mean + symmetry_mean + fractal_dimension_mean + radius_se + texture_se + smoothness_se + compactness_se + symmetry_se + fractal_dimension_se
The formula includes all of the variables that were finally selected at the end of the previous section. We will now run the logistic regression with this formula and take a look at the results.
# Run the model and report the results
model = smf.glm(formula=formula, data=X_train, family=sm.families.Binomial())
logistic_fit = model.fit()
print(logistic_fit.summary())
Generalized Linear Model Regression Results
============================================================================================
Dep. Variable: ['diagnosis[B]', 'diagnosis[M]'] No. Observations: 398
Model: GLM Df Residuals: 385
Model Family: Binomial Df Model: 12
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -55.340
Date: Thu, 15 Feb 2024 Deviance: 110.68
Time: 17:52:26 Pearson chi2: 125.
No. Iterations: 9 Pseudo R-squ. (CS): 0.6539
Covariance Type: nonrobust
==========================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 44.5427 11.787 3.779 0.000 21.441 67.644
radius_mean -1.1610 0.301 -3.862 0.000 -1.750 -0.572
texture_mean -0.4237 0.087 -4.866 0.000 -0.594 -0.253
smoothness_mean -85.3981 40.976 -2.084 0.037 -165.709 -5.088
compactness_mean -16.7104 22.510 -0.742 0.458 -60.829 27.408
symmetry_mean -46.2721 17.767 -2.604 0.009 -81.095 -11.449
fractal_dimension_mean -49.1536 121.888 -0.403 0.687 -288.050 189.742
radius_se -7.1916 2.806 -2.563 0.010 -12.691 -1.692
texture_se 0.1849 0.784 0.236 0.814 -1.353 1.722
smoothness_se 163.6068 159.702 1.024 0.306 -149.403 476.616
compactness_se -31.1808 42.772 -0.729 0.466 -115.012 52.650
symmetry_se 74.7366 51.458 1.452 0.146 -26.119 175.592
fractal_dimension_se 824.1245 412.040 2.000 0.045 16.541 1631.708
==========================================================================================
Here's a breakdown of the key parts of the summary:
Great! In the next section, we will feed in the test data to this model to yield predictions of labels. Then, we will evaluate how accurately the model have predicted the data.
In the previous section, we have successfully developed a logistic regression model. This model can take some unlabeled data and effectively assign each observation a probability ranging from 0 to 1. This is the key feature of a logistic regression model. However, for us to evaluate whether the predictions are accurate, the predictions must be encoded so that each instance can be compared directly with the labels in the test data. In other words, instead of numbers between 0 or 1, the predictions should show "M" or "B", denoting malignant and benign respectively. In our model, a probability of 1 corresponds to the "Benign" class, whereas a probability of 0 corresponds to the "Malignant" class. Therefore, we can apply a threshhold value of 0.5 to our predictions, assigning all values closer to 0 a label of "M" and assigniing all values closer to 1 a label of "B".
# predict the test data and show the first 5 predictions
predictions = logistic_fit.predict(X_test)
predictions[1:6]
id 848406 0.324251 907915 0.996906 911201 0.964710 84799002 0.000544 8911164 0.838719 dtype: float64
# Convert these probabilities into nominal values and check the first 5 predictions again.
predictions_nominal = [ "M" if x < 0.5 else "B" for x in predictions]
predictions_nominal[1:6]
['M', 'B', 'B', 'M', 'B']
We can confirm that probabilities closer to 0 have been labeled as "M", while the ones closer to 1 have been labeled as "B". Now we are able to evaluate the accuracy of our predictions by checking out the classification report and the confusion matrix.
print(classification_report(y_test, predictions_nominal, digits=3))
cfm = confusion_matrix(y_test, predictions_nominal)
true_negative = cfm[0][0]
false_positive = cfm[0][1]
false_negative = cfm[1][0]
true_positive = cfm[1][1]
print('Confusion Matrix: \n', cfm, '\n')
print('True Negative:', true_negative)
print('False Positive:', false_positive)
print('False Negative:', false_negative)
print('True Positive:', true_positive)
print('Correct Predictions',
round((true_negative + true_positive) / len(predictions_nominal) * 100, 1), '%')
precision recall f1-score support
B 0.982 0.965 0.974 115
M 0.931 0.964 0.947 56
accuracy 0.965 171
macro avg 0.957 0.965 0.961 171
weighted avg 0.966 0.965 0.965 171
Confusion Matrix:
[[111 4]
[ 2 54]]
True Negative: 111
False Positive: 4
False Negative: 2
True Positive: 54
Correct Predictions 96.5 %
Our model have accurately labeled 96.5% of the test data. This is just the beginning however. We could try to increase the accuracy even higher by using a different algorithm other than the logistic regression, or try our model with different set of variables. There are defintely many more things that could be done to modify our model. Let's look at some of them.
In this project, we set out to develop a predictive model to classify breast cancer tumors as benign or malignant based on various features extracted from digitized images of breast mass tissue. We employed a dataset containing information on the mean, standard error, and "worst" (mean of the three largest values) of ten different characteristics of the cell nuclei, such as radius, texture, smoothness, compactness, concavity, symmetry, and fractal dimension.
In conclusion, this project provided valuable insights into the process of developing a predictive model for breast cancer classification using machine learning techniques. It demonstrated the importance of data preprocessing, exploratory analysis, feature selection, model development, and performance evaluation in the context of healthcare analytics.
Kindly find my contact details listed below for your convenience. Your input is greatly appreciated.
Ignacio Antequera Sanchez